First Steps with Stan

Maud goes to Vegas

Andrew B. Collier

andrew@exegetic.biz | DataWookie | DataWookie

eRum (Budapest)

15 May 2018

Meet Maud

Pay Table: Cat and Mouse

description symbols payout
0
1x mouse 🐭 1
1x cat 🐱 2
2x mouse 🐭🐭 5
2x cat 🐱🐱 10
3x mouse 🐭🐭🐭 20
3x cat 🐱🐱🐱 100

Maud’s Burning Questions

  • What is the hit rate?
  • What is the RTP?
  • What is the hit rate for each winning combination?

Maud’s Data (by Session)

sessions %>% select(-details)
# A tibble: 100 x 7
   session spins  hits wager payout  hit_rate       rtp
     <int> <int> <int> <int>  <dbl>     <dbl>     <dbl>
 1       1     7     2    10      3 0.2857143 0.3000000
 2       2    19     7    30     29 0.3684211 0.9666667
 3       3    19     3    22      3 0.1578947 0.1363636
 4       4    26     7    30     13 0.2692308 0.4333333
 5       5    23     8    31     35 0.3478261 1.1290323
 6       6    20     8    26     12 0.4000000 0.4615385
 7       7    22     5    30     20 0.2272727 0.6666667
 8       8    22     4    25     10 0.1818182 0.4000000
 9       9    18     4    26      6 0.2222222 0.2307692
10      10    26     8    33     75 0.3076923 2.2727273
# ... with 90 more rows

Maud’s Data (by Spin)

(spin <- sessions %>% select(session, details) %>% unnest() %>% mutate(success = as.integer(payout > 0)))
# A tibble: 1,972 x 5
   session  spin wager payout success
     <int> <int> <int>  <dbl>   <int>
 1       1     1     1      1       1
 2       1     2     1      0       0
 3       1     3     1      0       0
 4       1     4     3      0       0
 5       1     5     2      0       0
 6       1     6     1      2       1
 7       1     7     1      0       0
 8       2     1     2      0       0
 9       2     2     2      0       0
10       2     3     1      1       1
# ... with 1,962 more rows

Q1: Hit Rate

Probability distribution for Bernoulli process:

\[ P(k | \theta) = \theta^k (1 - \theta)^{1 - k} \]

where

  • \(k = 1\) indicates success and \(k = 0\) indicates failure; and
  • the probability of success on any trial is \(\theta\).

Probability distribution for binomial process:

\[ P(k | n, \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n - k} \]

where

  • \(k\) successes in \(n\) trials; and
  • the probability of success on any trial is \(\theta\).

For session \(i\) there were \(k_i\) hits from \(n_i\) spins.

Bayesian Inference

Bayes’ Theorem \[ p(\theta|y, X) = \frac{p(y|X, \theta) p(\theta)}{p(y)} \propto p(y|\theta) p(\theta) \] where

  • \(y\) are observations;
  • \(X\) are predictors;
  • likelihood — \(p(y|X, \theta)\) is probability of data given parameters;
  • prior — \(p(\theta)\) is parameter distribution before data; and
  • posterior — \(p(\theta|y, X)\) is parameter distribution given data.

Monte Carlo (Plain Vanilla Version)

Generate independent samples of \(\theta^{(i)}\).

  • Need to have \(p(\theta^{(m)} | y, X)\).

Markov Chain Monte Carlo (MCMC)

Generate a series of samples: \(\theta^1\), \(\theta^2\), \(\theta^3\), … \(\theta^M\).

  • \(\theta^m\) depends on \(\theta^{m-1}\).
  • Need to have \(p(\theta^{m} | \theta^{m-1}, y, X)\).

Stan Workflow

  1. Choose a model.
  2. Write Stan program (likelihood and priors).
  3. Stan parser converts this to C++.
  4. Compile C++.
  5. Execute compiled binary.
  6. Evaluate results. Optionally return to 2.
  7. Inference based on posterior sample.

To use rstan you need a .stan and a .R .

Stan Skeleton

data {
  int<lower = 0> N;
  int<lower = 0, upper = 1> y[N];
}
parameters {
  real<lower = 0, upper = 1> theta;
}
model {
  y ~ bernoulli(theta);  
}

library(rstan)

# Use all available cores.
#
options(mc.cores = parallel::detectCores())

trials <- list(
  N       = nrow(spin),
  y       = spin$success
)

fit <- stan(
  file    = "bernoulli.stan",
  data    = trials,
  chains  = 2,                         # Number of Markov chains
  warmup  = 500,                       # Warmup iterations per chain
  iter    = 1000                       # Total iterations per chain
)

class(fit)
[1] "stanfit"
attr(,"package")
[1] "rstan"
samples <- as.data.frame(fit)

head(samples)
      theta      lp__
1 0.3326573 -1228.633
2 0.3136229 -1226.911
3 0.3276020 -1227.864
4 0.3203996 -1227.155
5 0.2942303 -1228.577
6 0.3166354 -1226.968
nrow(samples)
[1] 1000

data {
  int<lower=0> N;
  int hits[N];
  int spins[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  hits ~ binomial(spins, theta);       // Likelihood
  theta ~ beta(2, 2);                  // Prior
}

print(fit, probs = c(0.025, 0.5, 0.975))
Inference for Stan model: binomial-beta-prior.
2 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=1000.

          mean se_mean   sd     2.5%      50%    97.5% n_eff Rhat
theta     0.31    0.00 0.01     0.29     0.31     0.33   359    1
lp__  -1228.99    0.04 0.86 -1231.39 -1228.66 -1228.45   466    1

Samples were drawn using NUTS(diag_e) at Mon May 14 07:50:08 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Use summary() to get information for each chain.

Q2: RTP

data {
  int<lower=0> N;
  real rtp[N];
}
parameters {
  real<lower = 0> mu;
  real<lower = 0> sigma;
}
model {
  rtp ~ lognormal(log(mu) - sigma * sigma / 2, sigma);
}
generated quantities {
  real simulated[25];
  for (i in 1:25) {
    simulated[i] = lognormal_rng(log(mu) - sigma * sigma / 2, sigma);
  }
}

Inference for Stan model: lognormal-rtp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
mu      0.80    0.00 0.07   0.68   0.75   0.79   0.84   0.94  1601    1
sigma   0.71    0.00 0.05   0.62   0.68   0.71   0.75   0.83  1614    1
lp__  -16.16    0.03 1.01 -18.79 -16.56 -15.85 -15.44 -15.17  1520    1

Samples were drawn using NUTS(diag_e) at Mon May 14 07:46:34 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Q3: Hit Rate per Combination

Pay Table: Cat and Mouse

description symbols payout
0
1x mouse 🐭 1
1x cat 🐱 2
2x mouse 🐭🐭 5
2x cat 🐱🐱 10
3x mouse 🐭🐭🐭 20
3x cat 🐱🐱🐱 100
data {
  int<lower=0> N;
  real rtp[N];
}
parameters {
  real<lower=0, upper=1> theta[6];
  real<lower=0> sigma;
}
transformed parameters {
  real<lower=0> pay;
  pay = theta[1] * 1 +                 // Payline 1 -> 1x
        theta[2] * 2 +                 // Payline 2 -> 2x
        theta[3] * 5 +                 // Payline 3 -> 5x
        theta[4] * 10 +                // Payline 4 -> 10x
        theta[5] * 20 +                // Payline 5 -> 20x
        theta[6] * 100;                // Payline 6 -> 100x
}
model {
  rtp ~ lognormal(log(pay) - sigma * sigma / 2, sigma);
  theta[1] ~ beta(2, 2);               // Mode = 0.5
  theta[2] ~ beta(2, 4);               // Mode = 0.25
  theta[3] ~ beta(2, 5);               // Mode = 0.2
  theta[4] ~ beta(2, 8);               // Mode = 0.125
  theta[5] ~ beta(2, 10);              // Mode = 0.1
  theta[6] ~ beta(2, 16);              // Mode = 0.0625
}

                mean      se_mean          sd         2.5%       97.5%    n_eff      Rhat
theta[1] 0.140889558 1.453005e-03 0.091252684 0.0176434998 0.356934146 3944.185 0.9997016
theta[2] 0.067440693 6.854080e-04 0.043349006 0.0087506512 0.171524124 4000.000 0.9997912
theta[3] 0.028990953 2.894410e-04 0.018305855 0.0040694145 0.072763086 4000.000 0.9998409
theta[4] 0.014376898 1.439473e-04 0.009104029 0.0019740490 0.036289607 4000.000 1.0003147
theta[5] 0.007360153 7.254394e-05 0.004588082 0.0009349501 0.018455450 4000.000 1.0001251
theta[6] 0.001508806 1.453327e-05 0.000919165 0.0002091751 0.003683408 4000.000 0.9993676

The 1x payout is triggered on average every 7 spins.

The 100x payout is triggered on average only every 663 spins.

Exit Maud!

Maud’s Bayesian Answers

  • What is the hit rate? 31.3%
  • What is the RTP? 79.9%
  • What is the hit rate for each winning combination?
    • frequent low paying combinations (every 7 spins for 1x)
    • rare high paying combinations (every 663 spins for 100x)

Bayes: What’s It Good For?

  • Models where we care about parameter values.
  • Models where we want to quantify parameter uncertainty.
  • Generate some derived quantity from the posterior.

Here’s the Kicker

What’s the probability of breaking even?

mean(simulated$rtp > 1)
[1] 0.25014

What’s the probability of doubling your money?

mean(simulated$rtp > 2)
[1] 0.05116

Extra Stuff

Metropolis-Hastings Algorithm

  1. Randomly sample \(\theta^{(0)}\).
  2. Set \(i = 1\).
  3. Randomly sample proposal \(\theta^{*}\) in the vicinity of \(\theta^{i-1}\).
  4. Sample \(u\) uniformly on \([0, 1)\).

\[ \theta^{(i)} = \left\{\begin{array}{ll} \theta^{*} & \text{if } u \cdot p(\theta^{i-1}|y, X) < p(\theta^{*}|y, X) \\ \theta^{(i-1)} & \text{otherwise.} \end{array}\right. \]

  1. Increment \(i\).
  2. Return to 1.

Resources

Predictive Inference

Posterior Predictive Distribution

\[ p(\tilde{y}|y) = \int_\theta p(\tilde{y}|\theta) p(\theta|y) d\theta \]

Poisson

# trials <- list(
#   N       = nrow(sessions),
#   spins   = sessions$spins
# )
# 
# fit <- stan(
#   file    = "poisson.stan",
#   data    = trials,
#   chains  = 4,
#   warmup  = 1000,
#   iter    = 2000
# )
# extract(fit)

# hist(extract(fit)$lambda)

Regression

# ggplot(sessions, aes(spins, wager)) + geom_jitter()

Regression Stan

data {
  int<lower=0> N;
  vector[N] y;                         // Total wager
  vector[N] x;                         // Number of spins (standardised)
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma); // Likelihood
  //
  alpha ~ normal(0, 10);               // Prior (Normal)
  beta ~ normal(0, 10);                // Prior (Normal)
  sigma ~ cauchy(0, 5);                // Prior (Cauchy)
}

Regression Stan R

# summary(fit)
… the theory of inverse probability is founded upon error and must be wholly rejected. Sir Ronald Fisher (1925)

\[ \text{hit rate} = \theta^* = \frac{\text{total hits}}{\text{total spins}} = \frac{\sum_i k_i}{\sum_i n_i} \]

with(sessions, sum(hits) / sum(spins))
[1] 0.3128803

\[ \text{hit rate for session } i = \theta^*_i = \frac{k_i}{n_i} \]

# A tibble: 1 x 2
  session_avg session_std
        <dbl>       <dbl>
1   0.3100018   0.1030164

95% confidence interval extends from 28.9% to 33.1%.

Assuming that sessions \(i = 1, 2, 3, \ldots\) are independent:

\[ P(k | n, \theta) = \prod_i P(k_i | n_i, \theta) = L(\theta; n, k) \]

Log-likelihood for binomial process (multiple trials):

\[ LL(\theta; n, k) = \sum_i k_i \log{\theta} + (n_i - k_i) \log{(1 - \theta)} \]

# A tibble: 6 x 2
  theta log_likelihood
  <dbl>          <dbl>
1  0.31      -1225.411
2  0.32      -1225.604
3  0.30      -1226.146
4  0.33      -1226.692
5  0.29      -1227.843
6  0.34      -1228.649

ShinyStan

library(shinystan) launch_shinystan(fit) # # Diagnose -> PPcheck “Posterior predictive check” (look at distribution of observed data versus replications - do they look similar? “If our model is a good fit then we should be able to use it to generate data that looks a lot like the data we observed.”)